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1. Introduction 

In probability and statistics, a characterization theorem occurs whenever a given 
law or a given class of laws is the only one which satisfies a certain property. 
While probabilistic characterization theorems are concerned with distributional 
aspects of (functions of) random variables, statistical characterization theorems 
rather deal with properties of statistics, that is, measurable functions of a set 
of independent random variables (observations) following a certain distribution. 
Examples of probabilistic characterization theorems include 

- Stein-type characterizations of probability laws, inspired from the classical 
Stein and Chen characterizations of the normal and the Poisson distributions, 
see Stein (1972), Chen (1975); 

- maximum entropy characterizations, see Park and Bera (2009); 

- conditioning characterizations of probability laws such as, e.g., determining 
the marginal distributions of the random components X and Y of the vector 
[X, Y)' if only the conditional distribution of X given X + Y is known, see 
Patil and Seshadri (1964). 

The class of statistical characterization theorems includes 

- characterizations of probability distributions by means of order statistics, see 
Galambos (1972) and Kotz (1974) for an overview on the vast literature on 
this subject; 

- Cramer-type characterizations, see Cramer (1936); 

- characterizations of probability laws by means of one linear statistic, of iden- 
tically distributed statistics or of the independence between two statistics, see 
Lukacs (1956). 

Besides their evident mathematical interest per se, characterization theorems 
also provide a better understanding of the distributions under investigation and 
sometimes offer unexpected handles to innovations which might not have been 
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uncovered otherwise. For instance, Chen and Stein's characterizations are at the 
heart of the celebrated Stein's method, see the recent Chen, Goldstein and Shao 
(2010) and Ross (2012) for an overview; maximum entropy characterizations are 
closely related to the development of important tools such as Akaike's Informa- 
tion Criterion, see Akaike (1977, 1978); characterizations of probability distribu- 
tions by means of order statistics are, according to Teicher (1961), "harbingers 
of [...] characterization theorems" , and have been extensively studied around the 
middle of the twentieth century by the likes of Kac, Kagan, Linnik, Lukacs or 
Rao; Cramer-type characterizations are currently the object of active research, 
see Bourguin and Tudor (2011). This list is by no means exhaustive and for 
further information about characterization theorems as a whole, we refer to 
the extensive and still relevant monograph Kagan, Linnik and Rao (1973), or 
to Kotz (1974), Bondesson (1997), Haikady (2006) and the references therein. 

In this paper we focus on a family of characterization theorems which lie 
at the intersection between probabilistic and statistical characterizations, the 
so-called MLE characterizations. 

1.1. A brief history of MLE characterizations 

We call MLE characterization the characterization of a (family of) probability 
distribution(s) via the structure of the Maximum Likelihood Estimate (MLE) 
of a certain parameter of interest (location, scale, etc.). 

The first occurrence of such a theorem is in Gauss (1809), where Gauss showed 
that the normal (a.k.a. the Gaussian) is the only location family for which the 
sample mean x = n^^ X^ILi "always" the MLE of the location parame- 
ter. More specifically. Gauss (1809) proved that, in a location family g{x — 0) 
with differentiable density the MLE for is the sample mean for all samples 
x^") = (a;i, . . . , Xn) of all sample sizes n if, and only if, g{x) = Ke~^ 1"^ for n the 
adequate normalizing constant. Discussed as early as in Poincare (1912), this 
important result, now known as Gauss' principle, has attracted much attention 
over the past century and has spawned a spree of papers about MLE char- 
acterization theorems, the main contributions (extensions and improvements 
on different levels, see below) being due to Teicher (1961); Ghosh and Rao 
(1971); Kagan, Linnik and Rao (1973); Findeisen (1982); MarshaU and Olkin 
(1993) and Azzalini and Genton (2007). (See also Hiirlimann (1998) for an al- 
ternative approach to this topic.) For more information on Gauss' original ar- 
gument, we refer the reader to the accounts of (Hald, 1998, pages 354-355) and 
(Chatterjee, 2003, pages 225-227). See Nordcn (1972) or the fascinating Stigler 
(2007) for an interesting discussion on MLEs and the origins of this fundamental 
concept. 

The successive refinements of Gauss' principle contain improvements on two 
distinct levels. Firstly, several authors have worked towards weakening the reg- 
ularity assumptions on the class of distributions considered; for instance Gauss 
requires differentiability of g, while Teicher (1961) only requires continuity. Sec- 
ondly, many authors have aimed at lowering the sample size necessary for the 
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characterization to hold (i.e. the "always" -statement); for instance, Gauss re- 
quires that the sample mean be MLE for the location parameter for all sample 
sizes simultaneously, Teicher (1961) only requires that it be MLE for samples 
of sizes 2 and 3 at the same time, while Azzalini and Genton (2007) only need 
that it be so for a single fixed sample size n > 3. Note that Azzalini and Genton 
(2007) also construct explicit examples of non-Gaussian distributions for which 
the sample mean is the MLE of the location parameter for n = 2. We already 
draw the reader's attention to the fact that Azzalini and Genton (2007) 's result 
does not supersede Teicher (1961)'s, since the former require more stringent 
conditions (namely differentiability of the g's) than the latter. We will comment 
upon this phenomenon again at the end of Section 4. 

Aside from these "technical" improvements, the literature on MLE charac- 
terization theorems also contains evolutions in different directions which have 
resulted in a new stream of research, namely that of discovering new (that is, 
different from Gauss' principle) MLE characterization theorems. These can be 
subdivided into two categories. On the one hand, MLE characterizations with 
respect to the location parameter but for other forms than the sample mean have 
been shown to hold for densities other than the Gaussian. On the other hand, 
MLE characterizations with respect to other parameters of interest than the lo- 
cation parameter have also been considered. Teicher (1961) shows that if, under 
some regularity assumptions and for all sample sizes, the MLE for the scale pa- 
rameter of a scale target distribution is the sample mean x then the target is the 
exponential distribution, while if it corresponds to the square root of the sam- 
ple arithmetic mean of squares (^ ^lY^^^ then the target is the standard 
normal distribution. Following suit on Teicher's work, Kagan, Linnik and Rao 
(1973) establish that the sample median is the MLE for the location parameter 
for all samples of size n = 4 if and only if the parent distribution is the Laplace 
law. Also, in Ghosh and Rao (1971), it is shown that there exist distributions 
other than the Laplace for which the sample median at n = 2 or n = 3 is MLE. 
Ferguson (1962) generalizes Teicher (1961)'s location-based MLE characteriza- 
tion from the Gaussian to a one-parameter generalized normal distribution, and 
Marshall and Olkin (1993) generalize Teicher (1961)'s scale-based MLE charac- 
terization of the exponential distribution to the gamma distribution with shape 
parameter a > by replacing x as MLE for the scale parameter with x/a. 

There also exist contributions by Buczolich and Szekely (1989) where they 
investigate situations in which a weighted average of ordered sample elements 
can be an MLE of the location parameter. In parallel to all these "linear" MLE 
characterizations there have also been numerous developments regarding MLE 
characterizations for spherical distributions, that is distributions taking their 
values only on the unit hypersphere in higher dimensions, see Duerinckx and Ley 
(2012) and the references therein, or Section 8. Finally, there exists a further, 
different stream of MLE characterization research, inspired by Poincare (1912). 
This approach consists in relaxing the assumptions made on the role of the 
parameter 9 and choosing the MLE for this 6 to be of a certain general form 
(^•g-' n S"=i T{xi) for some known function T). Then the class of distributions 
which share this form of MLE for 6 can be determined. We refer the reader to 
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Campbell (1970) and Bondcsson (1997) for more explicit explanations. 

1.2. Applications of MLE characterizations 

Perhaps the most remarkable application of MLE characterizations (and of 
Gauss' principle) is to be found in the origins of this field of study, and in 
how such results have led to the discovery of "natural" distributions, such as 
the Gaussian, the FvML (see below for a definition) or even the exponential 
family. Indeed, the Gaussian distribution was uncovered by Gauss through his 
effort of finding the location family for which the sample mean a; is a most 
probable value for 9, the location parameter. Similarly, Poincare in his "Calcul 
des Probabilites" from 1912 derived a particular case of the exponential fami- 
lies of distributions (see (Lehmann and Casella, 1998, Section 1.5)) by asking 
for which distributions x is the MLE of 9, without specifying the role of the 
parameter. We refer to the historical remarks at the end of Bondcsson (1997) 
and the references therein for more information on both the works of Gauss and 
Poincare. Following Gauss' ideas, von Mises (1918) defined the circular (i.e., 
spherical in dimension two) analogue of the Gaussian distribution by looking 
for the circular distribution whose circular location parameter always has the 
circular sample mean as MLE; this led, inter alia, to the now famous Fisher-von 
Mises-Langevin (FvML) distribution on spheres. 

Of a totally different nature is Campbell (1970)'s use of MLE characteriza- 
tions. In his paper, Campbell establishes an equivalence between MLE char- 
acterizations in the spirit of Gauss' principle and the minimum discrimination 
information estimation of probabilities; to the best of our knowledge he is the 
only author to tackle continuous and discrete distributions in a single sweep. 
More recently, Puig (2008) has applied MLE characterizations in order to char- 
acterize the Harmonic Law as the only statistical model on the positive real 
half-line that satisfies a certain number of requirements. As a last example, 
we cite the work of Ley and Paindaveine (2010a) who have been able to solve a 
long-standing problem on skew-symmetric distributions by putting to use Gauss' 
principle. 

1.3. Purpose of the paper 

As is perhaps intuitively clear, the results from Teicher (1961); Ferguson (1962); 
Kagan, Linnik and Rao (1973); Marshall and Olkin (1993) or Azzalini and Genton 
(2007), to cite but these, stem from a common origin. As a matter of fact these 
authors all follow the same "smart path" that can be summarized in three steps 
: (a) choose the role of the parameter of interest 9 (location, scale, ...); (b) choose 
a remarkable form for the MLE for 9 (e.g., sample mean, variance or median); 
(c) use the freedom of choice in the samples as well as the sample size (two sam- 
ples of respective sizes 2 and 3, one sample of size 3, all samples of all sizes, ...) 
to obtain the largest class of distributions satisfying certain assumptions (con- 
tinuous at a single point, continuous, differentiable, ...) which share this specific 



Duerinckx et al. /Maximum likelihood characterization 



6 



MLE. While similar, the arguments leading to the different results nevertheless 
are largely adhoc and rest upon crafty manipulations of the explicit given form 
of the MLE. Moreover step (c) contains assumptions (on the minimal sample 
size and on the properties of the distributions being characterized) the necessity 
of which is barely addressed. 

The purpose of the present paper is to study the mechanism behind this 
"smart path", to show how all the above results are different instances of a 
common phenomenon originating in the properties of the score function of the 
distribution whose MLE is under consideration, and to explain the necessity of 
the assumptions appearing in the literature. More precisely, in this work not 
only will we identify minimal sufficient conditions under which densities are 
characterized by their MLEs but we will also introduce the concept of minimal 
covering sample size (which we abbreviate MC'SS) which enlarges the notion of 
minimal necessary sample size (MNSS) introduced in Duerinckx and Ley (2012) 
that provides the a priori smallest necessary sample size for an MLE character- 
ization to hold for a given family of distributions. We will provide a geometric 
interpretation to our MCSS and show how it, combined with the MNSS, explains 
many of the differences in the "always" -statements appearing in the literature. 
A side-product of our unified perspective on MLE characterizations is that we 
hereby also provide tools for (easily) constructing new MLE characterizations 
of many important families of distributions. 

1.4- Outline of the paper 

In Section 2 we describe the framework of our study and give all necessary 
notations. In Section 3 we establish and interpret the above-mentioned notion 
of MCSS which will be central to this paper. In Section 4, we derive the MLE 
characterization for univariate location families, while in Section 5 we proceed 
in a similar way with univariate scale families. In Section 6 we obtain MLE 
characterizations for the so-called group families, allowing us to study other 
roles of the parameter (e.g., skewness). In Section 7 we apply our findings to 
particular families of distributions. We conclude the paper with a discussion, in 
Section 8, of the different possible extensions that are left to be explored. 

2. Notations and generalities on ML estimators 

Throughout we consider observations X'") ~ {Xi,X2, . . . , X„) that are sampled 
independently from a distribution Pg (with density /) which we suppose to be 
entirely known up to a parameter 9 G Q G M.. The true parameter value 0o € Q is 
estimated by ML estimation on basis of X*^"^. As explained in the Introduction, 
our aim consists in determining which classes of distributions are identifiable by 
means of the MLE of the parameter of interest 9, which can, in principle, be 
of any nature (i.e., location, scale, etc.). On the target family of distributions 
{Pg : 9 S 0} we make the following general assumptions : 
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- (Al) The parameter space contains an open set Qq of which the true pa- 
rameter 9q is an interior point. 

- (A2) For each 6* S Go, the distribution Pg has support S independent of 9. 

- (A3) For all 1 < 2 < n the random variable Xi has a density f{xi;9) with 
respect to the (dominating) Lebesgue measure. 

- (A4) For 9^9' e Go, wc have P/ ^ 

These assumptions are taken from (Lehmann and Casella, 1998, page 444). 

Remark 2.1. Typically we will be concerned with either location families with 
densities of the form /(x; 9) = f{x — 9) for 9 € M. the location parameter, or 
scale families with densities of the form f{x; 9) = 9f{x9) for 9 € the scale 
parameter. It is clearly also possible to consider other roles for 9 (skewness, 
tail behavior ... ). Likewise, although we restrict our attention to the univari- 
ate setting, our arguments are in some cases transposable word-by-word to the 
multivariate case. Finally note that Assumption (A2) implies that only densities 
with full support M may be considered for ML estimation of a location param- 
eter, while only densities with support either R, or may be considered 
for ML estimation of a scale parameter. All these restrictions are natural; they 
nevertheless can, if deemed necessary, be lifted. 

We define, for a fixed sample size n > 1, the MLE of the parameter 9 as (if 
it exists) the measurable function 

ej."^ 5" := 5 X . . . X 5 ^ 90 : x^") (xi, . . . ,x„) ^^"^(x^")) 
for which 

n n 

Y[f{xf,9fi^^"^))>Y[f{x.;9) (1) 

i=l 4=1 

for all 6* G 8o and all samples x'"' S S" of size n. It is not trivial to provide 
minimal conditions on / under which ^^"^(x'^"-') exists, is uniquely defined and 
satisfies the necessary measurability conditions. Consistency of the MLE is also 
a delicate matter and further regularity conditions are required for the problem 
to make sense. As in Cramer (1946a, b) one may suppose that, for almost all x, 
the density f{x; 9) is differentiable with respect to 9. This allows to define the 
MLE as the solution of the local likelihood equation 

n 

J2^fi^^■,S)^0 (2) 

4=1 

where 

^f{x;9) ■.^^\ogf{x;9) 

is the score function of the density / associated with the parameter 9 (we set 
this function to outside the support of /). The solution to (2) has, at least 
asymptotically, the required properties (see (Lehmann and Casella, 1998, The- 
orem 6.3.7)). Furthermore, this way of proceeding allows for a simple sufficient 
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condition for uniqueness of the MLE: the mapping x i~-7> ipf{x;9) has to be 
strictly monotone and to cross the x-axis. Note that this requirement coincides 
with strong unimodality or log- concavity of the density / when is a location 
parameter (see (Lehmann and Casella, 1998, Exercise 6.3.15)). 

Remark 2.2. Although in general there is no explicit expression for the MLE 
of a given parametric family, there exist several important distributions that not 
only satisfy all the above requirements but also allow for MLEs which take on 
a remarkable form. Taking f = 4> the standard normal density, the MLE for 
the location parameter is x := ^'Yll^=i^ij sample arithmetic mean, while 
that for the scale parameter is X]r=i )^^^' square root of the sample 
arithmetic mean of squares. Taking f the exponential density, the MLE for the 
scale parameter becomes x. 

One easily sees from both definitions (1) and (2) that, if 0^"'' maximizes the 
/-likelihood function, then it also maximizes the g-likelihood function for, e.g., 
g = cf^ with d > and c a normalizing constant. More generally, any two 
parametric densities f{x; 9) and g{x] 9) with same support S and such that 

ipg{x;9) = d(pf{x;9)\fx ^ S (3) 

for some d > yield the same MLE. Thus, without further conditions or speci- 
fications on the target density / (that is, the density associated with the target 
distribution Pj) and on g, MLE characterization theorems identify a class of 
distributions rather than a single distribution. Considering for example Gauss' 
principle, both Teicher (1961) and Azzalini and Genton (2007), to cite but these, 
identify the Gaussian distribution with respect to its location parameter only 
up to an unknown variance. On the contrary, when dealing with scale charac- 
terizations, Teicher (1961) adds a further constraint in order to be able to single 
out the standard exponential and the standard Gaussian distribution. In what 
follows, we shall state our results in the most general possible way, without 
(at least in the main results) considering additional identification constraints. 
The trivial observation from above therefore leads us to partitioning the space 
of densities into parameter-dependent equivalence classes (e.c. hereafter) in the 
sense that two densities are equivalent if their score functions satisfy (3). As 
will become clear from the subsequent sections, the nature of the parameter of 
interest 9 heavily influences this partition, so that, for each type of parameter, 
one needs to first identify the e.c.'s (see the beginnings of Sections 4 and 5 for 
an illustration). 

The framework we have developed so far allows us to reformulate the question 
underpinning the present article in an arguably more transparent form, namely 
"Do there exist two distinct e.c. 's J- (9) and Q{9) such that the distributions Pg 
and Pg for f e J^i9) and g e Qi9) share a given MLE of the parameter of 
interest 9?" In other words, let J^{9) be the target e.c. we want to characterize, 
and let Q{9) be another e.c. whose parametric densities have the same support S. 
Let / G ^{9) and g <E G{9). Suppose that, for some n > 1, the estimator ^g""*. 
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the MLE of 9 under coincides with O^P' for all samples of size n, i.e. 

n n 

5^1ogg(x,;^J"Hx("))) > J2\oggixf,e) (4) 

1=1 i=l 

for all (9 G Go and all samples x'") G S*" of size n. Intuitively it seems clear 
that, if n is large enough, the only way for / and g to satisfy (4) for all samples 
is that (fig and iff he close to one another, hence that the two e.c.'s are in fact 
the same. This intuition is, as we shall see, correct and is the heart of all the 
literature on MLE characterizations. 



3. The minimal covering sample size 

The first step towards establishing our general MLE characterizations is to gain 
a better understanding of the meaning of "sufficiently large n" for (4) to induce 
a characterization theorem. We use the notation iJ„ to denote the hyperplane 

Hr, = |b(") = (fei, ...,&„) e R" I b, = 

and associate, with any parametric distribution Pg satisfying the requirements 
of Section 2, the collection(s) of sets 

Bl''^{A) ^ {b(") e Hn I 3x("' e A" with bj = (fifix-iie) for aU 1 < j < n} (5) 

where A" = A x . . . x A is the n-fold cartesian product oi A C S, the support 
of the target /, and 9 G Qq. The interplay between the sets and the 

hyperplanes H„ determines the minimal sample size n for a characterization 
theorem to hold. The following lemma is crucial to our approach. 

Lemma 3.1. Suppose that for some 9 € Qq the mapping x i— ^ ipf{x;9) is 
strictly monotone over some interval X G S and that the (restricted) image 
If^Q{X) := {Lpf{x; 9)\x G X} is of the form {—Pjg, P^e)' /''^ positive constants 
PJfi^Pte (possibly infinite). Then, for all n>l, Bf/{X) = iJ„ n 
Also, letting 

^/ Pf,e — Pf',e 

Pi,e + ph fl'^rf pj,B - ph < +°° 

otherwise, 
we get 

• for n < Nffi, the orthogonal projections Iixj{B^^'^ {X)) C {—Pjg^P'j'g) for 
all j = 1, . . . , n; 

. for all n > Nf^e, H., {Bk'{X)) = (-P^;^, P+g) for allj^l,..., n. 



2 



00 



f.e) 
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Fig 1. The sets -Bj' (K) when If^g = [-1, 1] (left plot) and If^g = [-1,3] (right plot). The 
red lines indicate the marginal projections. 

The number Nf_g is called the Minimal Covering Sample Size (MCSS). 

The MCSS, greater than or equal to 2, is the sniaUest possible value of the 
sample size n that ensures that all the projections of Bl'^{X) onto the distinct 
subspaces generated by each observation Xj,j = 1 , . . . , n, cover entirely I/.g {X) . 
For ease of reference we say, in that case, that BI['^{X) is projectable. 

Example 3.1. Take f{x]0) = (j){x — 9) the standard Gaussian density with 
e R a location parameter. We have S = M, f4i{x] 6) ~ x — 9 (and 9^^^^ = x, 
the sample mean). Then ip^{x;9) is invertible over R, T^^g(J&.) = M (for all 9) 
and easy computations show that = _ff„ for all n. Note that we always 

have Il^-{B^''^(R.)) = K; in other words, S^'^(R) is projectable for all n > 2 
(and all 6) and hence MCSS = 2. 

Example 3.2. Take ipf{-;9) to be monotone on S* = R with symmetric image 
1, 1), say, independently of 9. Then clearly B2^{M.) is the intersection between 
the line H2 = x + y = (see Figure 1) and the square (—1, 1)^, while i?3(R) is 
the intersection between the plane H3 = x + y + z = and the cube (—1, 1)^. 
Consequently the coordinates of points on _B|'^(R) and B^'^ (M.) cover the full 
image (-1, 1). Hence MCSS = 2. 

Next suppose that </?/(•; 9) has a skewed image (—1, 3), say, still independently 
of 9. Then, while i?2'^(K) and B^'^ (S.) remain defined as above (with (—1,3) 
replacing (— coordinates of points in these domains do not cover the full 
image. In fact, in i?2'^(R) these coordinates only cover the interval (—1,1) (see 
Figure 1), in B^'^{1S.) these coordinates only cover the interval (—1,2) and it is 
only from n > 4 onwards that the coordinates of points in _B,{'^(R) cover the full 
image. Hence MCSS = 4. 
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Example 3.3. Take f{x; 9) ~ 9(f){9x) the Gaussian density with 9 £ a scale 
parameter. We have S — M., (p^{x;9) = ^(1 — 9^x^). Then ip^{-;9) is invertible 
over M.Q and M.q , separately, andIffi{W^) = (— oo,l/0). Note that we have 
n^^.(B/^''(R^)) = {-{n- 1)19,1/9) for all j, all 9 and all n > 2. In other 
words, _B^'^(IRJ) is only asymptotically projectable. Hence MCSS = +oo. 

Proof of Lemma 3.1. The assumptions on the image of iff {■■,9) guarantee the 
existence of a point in X where (pf crosses the x-axis so that b('^{X) = {0}. 
Also, for all n > 1, we have Bl'^iX) = ff„n(J/,0(A'))"; this follows by definition. 
Regarding the MCSS, first take Pj'g = P^g = P (possibly infinite). Then, for 
all n > 2, Hn D {—P, P)" contains for each of the n coordinates the full interval 
(-P,P) (see Figure 1); hence MCSS = 2. Next suppose that Pfg < P^g < 
oo and consider a "worst-case scenario" by taking a point at the extreme of 
Hn n {If^g{X))"', with one coordinate set to bi = Pj'g — e for some e > 0. Then, 
in order to construct a sample satisfying — it is necessary to choose 

the remaining n — 1-tuple (62, . . . , 6„) so as to satisfy J27=2 ~ ^^i- Since the 
best choice in this respect consists in setting all bi near the other extremum 
—Pjg + e' for e' > 0, we see that, depending on the magnitude of the ratio 
P'fg/PJg, a given sample size n may not be large enough for the equality to 
hold. In order to palliate this it suffices to take Nf^o to be the smallest natural 

number such that P+g - {Nf^e - '^)PJfi ^ 0, that is Nf^ = P^sl^Jfi + 1 ■ 
The case P^g < Pjg < 00 follows along the same lines, and hence 



MCSS = 



max(P+ ,P..) 



min(P+ ,P 



f4 



The same argumentation applies in the case where either one of Pj'g or Pj g is 
infinite, this time with MCSS = +00. This concludes the proof. □ 

Now look at how the connection between the MCSS and MLE characteri- 
zations. Let / and g be two representatives of distinct e.c.'s with / the target 
density. Under the assumption of ^-differentiability of g, the defining equation 
(4) can be rc-expresscd as 



Va{x^] ^^"^(x("))) = for all x^") e 5", 



1=1 

which in turn can be rewritten as 



^ ^g{x^;9) = for ah 9 and all x^") such that ^ (^/(x,; 61) = 0, (7) 



(here 9 and x^") are interdependent) or, equivalently. 



^ h{y,- 6*) = for ah 9 and ah y*") € Bf/{S), (8) 



i=l 
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with h{y]9) — VgifJ {y'iS)\9) (recall ipf is supposed to be invertible at any 
0.) Equation (8) completely identifies the function h (hence g in terms of /), 
at least when Bl'^{S) is sufficiently rich. This richness depends strongly on the 
MCSS introduced in Lemma 3.1. Indeed supposing that (8) is only valid for a 
sample size smaller than the MCSS implies that portions of the images Xffi{S) 
cannot be reached, so that h cannot be identified over its entire support. This 
necessarily implies that MNSS > MCSS. It is, however, pointless to try to solve 
(8) in all generality and it is now necessary to specify the role of in order to 
pursue. We will do so in detail in the next sections, first in the case of location 
parameters; our arguments will afterwards adapt directly to other parameter 
choices. 

4. MLE characterization for location parameter families 

Let us start by determining the e.c.'s for 9 a location parameter. In such a 
case, 00 = 5* = K, and the location score functions are of the form v?/(x; 9) = 
(pf{x — 9) ^ —f'{x — 9)/f{x — 9) over M, so that equation (3) turns into a 
simple first-order differential equation whose solution yields g{x) = c{f{x)Y for 
some d > and c the normalizing constant. Thus, all densities which are linked 
one to another via that relationship belong to a same e.c. We here attract the 
reader's attention to the fact that, for / = the standard Gaussian density, such 
transformations reduce to a non-specification of the variance, which is clearly in 
line with Gauss' principle as stated by Teicher (1961) or Azzalini and Genton 
(2007). 

Our first main theorem is, in essence, a generalization of Gauss' principle 
from the Gaussian distribution to the entire class of log-concave distributions 
with continuous score function. 

Theorem 4.1. Let J-"(loc) and t?(loc) he two distinct location-based e.c.'s and 
let their respective representatives f and g be two continuously differentiable 
densities with full support R. Let x H> ^f{x) = —f'{x)/f{x) be the location 
score function of f . If (pf is invertible over R and crosses the x-axis then there 
exists N N such that, for any n > N , we have 9^^^ ~ 9^g^^ for all samples of 

size n if and only if there exist constants c,d G Rq such that g{x) = c{f{x))'^ 
for all x G R, that is, if and only if J- (loc) = Cy(loc). The smallest integer for 
which this holds (the Minimal Necessary Sample Size) is MNSS = max{iVy,3}, 
with Nf the MCSS as defined in Lemma 3.1. 

Proof. The sufficient condition is trivial. To prove necessity first note how our 
assumptions on / ensure that the score function ipf is strictly decreasing on 
the whole real line M and has a unique root. This allows us to write the image 
Im((^/) as {—PJ, Pj') with < Pj ,P^ < oo. The differentiability of g and the 
nature of the parameter 9 permit us to rewrite, for any admissible 9, (7) as 



n n 

ipg{xi - 6*) = for all x*") e R" such that ^ ((3/(x,, - 6i) = 0, (9) 

i=l i=l 
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where (pg{x) — ~g' (x)/ g{x). Using the strict monotonicity oi ipf one then con- 
cludes that (9) is equivalent to requiring that g satisfies 

n 

^/i(&,)=0 for all (6i,...,6„) e Bi{'^(M) (10) 
1=1 

where h = ipgO cpj^ and hi ^ if j (xi — 9) , i ^ 1 , . . . , n, as in (5). In what follows, 
we shall use our liberty of choice among all n-tuples b*^") S i?^'^(M) in order to 
gain sufficient information on h to conclude. 

First suppose that P^g = P^g = P, hence that the image of <p/ is symmetric. 
We know from Lemma 3.1 that the corresponding MCSS equals 2, hence that 
two observations suffice to make i?^'^(R) projectablc. Therefore, for any n > 2, 
we can always build an n-tuple hi, . . . ,bn such that 62 = —^1 for all 61 G {—P, P) 
and 6i = for i = 3, . . . , n. From (10) we then deduce that h satisfies the equality 
h{—a) = —h{a) for aU a £ {—P,P), hence that h is odd on (— P, P). Evidently 
this leaves h undetermined, hence the MNSS must at least equal 3. For n > 3, 
choose an n-tuple such that 63 = —hi — 62 and 6^ = for i = 4, . . . , n, for 
bi,b2 e {-P,P) such that 61 + 62 e {-P,P). Using (10) combined with the 
antisymmetry of h we deduce that this function must satisfy 

h{b) + h{c) = h{h + c) (11) 

for all h,c E i^Pj P) such that 6 + c G (— P, P)- One recognizes in (11) a (re- 
stricted) form of the celebrated Cauchy functional equation. Assume that P < 
00; then h{P/2), say, is finite and standard arguments (see, e.g., (Aczel and Dhombres, 
1989)), imply that our solution h satisfies h(uP/2) = uh{P/2) for ah u £ (-2, 2) 
and we conclude that h{x) = dx for all x £ (-P, P), with d = /i(P/2)/(P/2) £ 
R. Considering x ~ (fijiy) for y £ ip^^{—P,P) = M, we obtain that fgiy) = 
d(pf{y). Solving this first-order differential equation gives g{y) — c{f{y))'^ for 
all y £ IS., with c a constant. In order for the function g to be integrable over 
K, the constant d must be strictly positive; in order for g to be positive and 
integrate to 1, the constant c must be a normalizing constant. Thus, for P < 00, 
the problem is solved. For P = 00, the situation becomes even simpler as (11) is 
then precisely the Cauchy functional equation, and one may immediately draw 
the same conclusion as for finite P. 

Let us now consider the case where iff has a skewed image and set P = 
min(P^,P^) (note that P is necessarily finite as otherwise Im^ipf) would be 
symmetric). First restricting our attention to (— P, P), we can repeat the above 
arguments to deduce that g{y) = c{f{y)Y for all y £ '^J^i^P^ P) S We thus 
further need to investigate the behavior of h on the remaining part of lm{ipf) 
which, for the sake of simplicity, we denote as Out(P) (it is either {—PJ, — P) 

or {P,Pj')). To this end, we precisely need to know the MCSS and hence call 
upon Lemma 3.1. Fixing n > Nf and taking a sample (61,..., 6„) such that 
Er=i^* = with ^1 ^ Out(P) and (62,..., 6„) £ (-P,P)"-\ we can apply 
(10) to get h{hi) + X]r=2 ^(^0 ^ ^^"i hence, from our knowledge about the 
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behavior of h on {~P,P), we deduce that 



n 



Kh) = - ^ bMP)/P = bihiP)/p, 



since h{P) is necessarily finite. Consequently we get h{y) = yh{P)/P for all 
y e i-P,P) U Out(P) = Ini((p/) and g{y) = c{f{y)Y for all y € M, and the 
conclusion follows. 

The proof of the theorem is nearly complete : all that remains is to show that 
the MNSS = max{3, Nf} is minimal and sufficient. The latter is immediate since 
if the result holds true for any sample of size N = MNSS then, for any larger 
sample size M > MNSS, one can always consider x'^^' such that <y9/(x'^^^^) = 



b(^^) e S{/(R) is of the form (61, . . . , 6^, 0, . . . , 0) and (6i,...,6Ar) € Bj^^R), 



and work as above to characterize the density. To prove the minimality of the 
MNSS it suffices to exhibit specific counter-examples. This is done in Examples 



Example 4.1. To see that N ~ 3 is minimal when Im((y5y) is symmetric, we 
need to construct two distributions gi , g2 which share f 's MLE for all samples 
of size 2. Construct gi as in the proof of Theorem 4-1- To construct g2 it suf- 
fices to replace the function h from (10) with any odd function and to solve the 
resulting equation in g (while ensuring integrability of g). If, for example, we 
choose h{x) = dx^ , then we readily obtain g{y) ~ cexp(— d J^^{^f{y))^dy); 
this is however not a density for all f , though a good choice for f — (j) the 
Gaussian (for which (p,p{x) ~ x). Another way of proceeding is to work as in 
Azzalini and Genton (2007) and to choose h[y) = y + w'{y) for some differen- 
tiable even function w. 

Example 4.2. Suppose that ^^"^ = ^g"' for all samples of size n for some 
n < Nf when Nf > 3. Then, as is clear from Lemma 3.1, the whole domain 
(that is, M) of f is not identified by our technique and it suffices to choose any 
density which is equal to f on the maximal identifiable subdomain but differs 
elsewhere. Expressed in terms of h for the case PJ < P'j' , we can only identify 
h on some interval {—PJ,PJ + a{n)), say, with < a{n) < P^ — PJ . On 
the remaining part [PJ + a{n),P^), h is undetermined and hence can take 
any possible form, implying that the relationship between g and f can only be 
established on the part (pJ^{—Pj, PJ + a{n)) C M. 

As in Azzalini and Genton (2007), it is sufficient to require in Theorem 4.1 
that g be continuously differentiable at a single point for everything to run 
smoothly. Pursuing in this vein, it is of course natural to enquire whether the 
result still holds if no such regularity assumption is imposed on 5, i.e. if we only 
suppose that the target density / is differentiable but g is a priori not. Put 
simply the question becomes that of enquiring whether the condition 



4.1 and 4.2 below. 



□ 



n 



71 




(12) 
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for all x^"^ e M" and all 6* £ K suffices to determine g. This is the approach 
adopted, e.g., in Teicher (1961); Kagan, Linnik and Rao (1973) or Marshall and Olkin 
(1993), where it is shown that having the likelihood condition (12) with ^j"-* the 
sample mean implies g is the Gaussian as soon as the result holds for all samples 
of sizes 2 and 3 simultaneously. Interestingly, in our framework, this arguably 
more general assumption on g comes with a cost : our method of proof then 
necessitates imposing more restrictive assumptions on / and requiring the like- 
lihood equations to hold for two sample sizes simultaneously. 

Theorem 4.2. Let J^(loc) and Q{loc) be two distinct location-based e.c.'s and 
let their respective representatives f and g be two continuous densities with full 
support R. Suppose that f is symmetric and continuously differentiate, and 
assume that its location score function (pf{x) is invertible over R and crosses 
the X-axis. Then we have 9^^^ = ^g"^ for all samples of sizes 2 and ji > 3 
simultaneously if and only if there exist constants c,d £ M.'^ such that g{x) = 
c{f{x))'^ for all X gR, that is, if and only if T{\oc) = Q{\oc). 

Proof. Our proof, which extends that of Teicher (1961) from the Gaussian case 
to the entire class of symmetric log-concave densities /, proceeds in two main 
steps: we first show that our assumptions on g in fact entail that g is continu- 
ously differentiable, and then conclude by applying Theorem 4.1. The additional 
sample size n = 2 needed here stems from the first step. 
Condition (12) can be rewritten as 



XI log 5 (yO > J2^°^9iyz 



1=1 i=l 

for all S R and yi, . . . , y„ satisfying VfiUi) — 0- The latter expression in 

turn is equivalent to 

/ / n-l 

> J2\og g{y,- 6)+ log g(^^J^ - ^ j ■ (^3) 

Arguing as in Teicher (1961), it is sensible to confine our attention at first to 
symmetric densities g. Using the assumed symmetric nature of / (and hence the 
oddness of fj^), considering the sample size n = 2 and setting observation yi 
equal to some y € R, (13) simplifies into 

2 log g{y) > log g{y - 0) + log g{y + 9) (14) 

for all G R. Since logg is everywhere finite, concave according to (14), and 
inherits measurability from g, it is an a. e. -continuously differentiable function. 
Arrived at this point, we may apply Theorem 4.1 to conclude. 
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Finally, for no n- necessarily symmetric densities we can follow exactly the 
argumentation from Teicher (1961) and derive that the previously obtained 
solution is the only one, hence the claim holds. □ 

We stress the fact that, as in Teicher (1961), we may further weaken our 
assumptions on g by only requiring that it is lower semi-continuous at the origin 
and need not have full support R. Indeed, as shown in Teicher's proof, continuity 
and a.e.-positivity ensue from the above arguments. 

One may wonder whether the symmetry assumption on the target density / is 
necessary or whether this second general location MLE characterization theorem 
may in fact hold for the entire class of log-concave densities as well. Our method 
of proof indeed requires this assumption so as to enable us to deal with such 
quantities as (fij^ 'P fiVj)) (13); without any assumption on /, for 
71 = 2, this expression does not simplify into the agreeable form — j/i. Similarly 
one may wonder whether it is necessary to suppose the result to hold for two 
sample sizes simultaneously or whether one single sample size > 3 might not 
suffice. We leave as open problems the question whether these assumptions are 
necessary or simply sufficient. 

Finally our Theorems 4.1 and 4.2 do not cover target densities whose location 
score function is monotone but not invertible over the entire real line, that is, 
piecewise constant. We do not consider explicitly such setups here since they do 
not assure that the MLE is defined in a unique way. The strategy we however 
suggest consists in applying our results on the monotonicity intervals, to draw 
the necessary conclusions and express g in terms of / on those intervals. If we 
add the condition of monotonicity of ipg , the equality (pg{x) = dipf (x) has to hold 
over the entire support R as monotonicity imposes (pg to be constant outside 
the above-mentioned intervals. Since we here do not implicitly use the intervals 
where ^Pf is constant, there might exist better strategies, and consequently the 
smallest possible sample size we obtain by following this scheme is an upper 
bound for the true MNSS. The most extreme situation takes place when the 
target is a Laplace distribution, in which case ^Pf{x) = sign(a;); we refer to 
Kagan, Linnik and Rao (1973) for a treatment of this particular distribution. 
We will return to these matters briefly in Section 8. 

5. MLE characterization for scale parameter families 

As for location parameter families, we start by fixing the e.c.'s when 9 plays 
the role of a scale parameter. In such a setup, 6o = Rq, 5 = R, or 
in view of Assumption (A2), and the scale score functions are of the form 
<^/(x;6') = l-ipfiOx) := i(l + OxfiOx)/ f{ex)) over 5", so that equation (3) 
turns into another quite simple first-order differential equation whose solution 
leads to g{x) = c|x|''~^(/(a;))'^ for some d > (such that g is integrable) and c 
the normalizing constant. This relationship defines the scale-based e.c.'s. It is to 
be noted that c ~ d = 1 when the origin belongs to the support, that is, when 
iS* = R, in which case the e.c.'s reduce to singletons {/}. 
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Our main scale MLE characterization theorem is the exact equivalent of 
Theorem 4.1 with ipf replacing ipf, hence its proof is omitted. 

Theorem 5.1. Let J- (sea) andQ{sca) be two distinct scale-based e.c.'s and let 

their respective representatives f and g be two continuously differentiable densi- 
ties with common support S (either M.,M.q or RqJ. Letij;f{x) = l + xf'{x)/f{x) 
be the scale score function of f . If ipf is invertible over S and crosses the x- 
axis then there exists e N such that, for any n > N , we have 9j.- ^ 6g 
for all samples of size n if and only if there exist constants c,d £ Rq such 
that g{x) = c\x\'^-'^[f[x)Y for all x e S (with c ^ d = 1 for S = R), that 
is, if and only if J^(sca) = G (sea) . The smallest integer for which this holds is 
MNSS = max{7Vy,3}, with Nf the MCSS as defined in Lemma 3.1. 

As in the case of a location parameter, requiring differentiability of the g's 
is not indispensable. One could indeed restrict the class of target distributions 
under consideration, as in Teicher (1961). We leave this as an easy exercise. 

When dealing with scale families it is natural to work as in Teicher (1961) 
and add a scale-identification condition of the form 

Yiui g{\x)/g{x) - lim /(Ax)//(x) VA > 0. (15) 

Imposing this condition in Theorem 5.1 allows, at least when the limit is finite 
and positive, to deduce that c = c? = 1 for = R(J" and Rq , and hence g = 
f in all cases. Interestingly Teicher already remarks that this "seemingly ad 
hoc condition appears to be crucial"] this is clearly the case for a complete 
identification of the family of densities which share a scale MLE. 

The invertibility condition imposed on ipf is as natural in a scale family 
context as the invertibility condition on (^y in a location family setup (see 
(Lehmann and Casella, 1998, page 502)). Unfortunately it suffers from one ma- 
jor drawback for S = R: requiring invertibility oi ipf over the whole real line 
forces us to discard several interesting cases such as, e.g., the standard normal 
density 0, for which Tp^{x) = 1 — is only invertible over the positive and neg- 
ative real half-lines, respectively. More generally any symmetric density / for 
which 1^/ is invertible over R will suffer from that same problem and hence will 
not be characterizable by means of Theorem 5.1. This flaw is nevertheless easily 
fixed, since Lemma 3.1 is applicable even if ■0/ is only invertible over portions 
of its support. This leads to our next general result (whose proof is omitted). 

Theorem 5.2. Let J-'(sca) and Q{sca) be two distinct scale-based e.c. 's and let 
their respective representatives f and g be two continuously differentiable densi- 
ties with full support R. Let the scale score function ipfix) = 1 + xf'{x)/f{x) be 
invertible and cross the x-axis over M.^ and R^ , respectively. Then there exists 
TV S N such that, for any n > N , we have d f — 9g for all samples of size 
n if and only if g{x) — f{x) for all a; G R. Moreover the MNSS is given by 
ma.x{MNSS-,MNSS+}, where MNSS- and MNSS+ respectively stand for the 
MNSS required on each half-line. 
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It should be noted that the scale condition (15) is not necessary here since we 
are working on the entire support 5 = K which imposes that d = 1 as otherwise 
the non- vanishing density g would vanish at 0. 

Finally note that the separation of the two real half-lines is tailored for scale 
families because both Mq" and K,^ are invariant under the action of the scale 
parameter, which permits us to work on each half-line separately and put the 
ends together by continuity. The same would not hold true for location fami- 
lies due to a lack of invariance, that is we could not "glue together" location 
characterizations valid on complementary subsets of the support. 



6. MLE characterization for one-parameter group families 

The relevance of our approach is not confined to location and scale families, 
but can be used for other ^-parameter families with 9 neither a location nor a 
scale parameter. In this section, we shall consider general one-parameter group 
families and provide them with MLE characterization results. Group families 
play a central role in statistics as they contain several well-known paramet- 
ric families (location, scale, several types of skew distributions as shown in 
Ley and Paindaveine (2010b), ...) and allow for significant simplifications of 
the data under investigation (see (Lehmann and Casella, 1998, Section 1.4) for 
more details). To the best of the authors' knowledge, there exist no MLE char- 
acterizations for group families other than the location and scale families. 

A univariate group family of distributions is obtained by subjecting a scalar 
random variable with a fixed distribution to a suitable family of transformations. 
More concretely, let X be a random variable with density / defined on its 
support S and consider a transformation group T-L (meaning that it is closed 
under both composition and inversion) of monotone increasing functions : 
D CM. ^ S depending on a single real parameter 9 G &o- The family of random 
variables {Hg^{X),Hg € "H} is called a group family. These variables possess 
densities of the form 

fHix;9):^H'g{x)fiHeix)), (16) 

where H'g stands for the derivative of the mapping x i— > Hg(x)] their support D 
does not depend on 9. We call 9 a ?^-parameter for f{x; 9). The most prominent 
examples are of course H\oc ■= {Hg{x) = x — 9,x,9 G K}, leading to location 
families, and "Hsca := {Heix) = 9x,x G S,9 G K^} for 5 = R,Mj and % , 
yielding scale families. For further examples, we refer to (Lehmann and Casella, 
1998, Section 1.4) and the references therein. 

Let us now determine the e.c.'s for H-parameter families. As we shall see in 
a few lines, we will need to further specify the form of fui^l S), more precisely 
the way 9 is acting inside the densities. Assuming that the mappings 9 i— >■ Hg{x) 
and 9 i— >■ Hg (x) are differentiable, the "H-scorc function associated with densities 
of the form (16) corresponds to 

..n..n.._ deH',ix) deHe{x)f {Hg{x)) 
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over D (it is set to outside D). Extracting e.c.'s from equation (3) is all but 
evident here, as (i) there is no structural reason for dgHg{x) to cross the x-axis 
so as to allow to fix d to 1 as in the scale case over R, and (ii) the generality of 
the model hampers a clear understanding of the role of 9 inside the densities. 
Especially the latter point is crucial, as e.c.'s cannot depend on the parameter. 
For this reason, we restrict our attention to transformations Hg satisfying the 
following two factorizations: 

r dgHg{x)^T{e)Ui{Hg{x)) 
1 ^j}j^-^mU,iHg{x)), 

where T, Ui and U2 arc real-valued functions. At first sight, such restrictions 
might seem severe, but there exist numerous one-parameter transformations 
enjoying these factorizations, including 

- transformations of the form Hg{x) = ai{x) + 02(0) defined over the entire 
real line, with ai a monotone increasing differentiable function over R and 
a2 any real-valued differentiable function. These transformations lead to 
"generalized location families" and satisfy the above factorizations with 
T{e) = 4(6'), Ui{x) = 1 and U2{x) = 0. 

- transformations of the form Hg{x) = ai{x)a2{0) defined over M,]Rj or 
M^, with ai a monotone increasing differentiable function over the cor- 
responding domain and 02 a positive real-valued differentiable function. 
These transformations lead to "generalized scale families" and satisfy the 
above factorizations with T(0) = a2{0)/a2{9), Ui{x) — x and U2{x) = 1. 

- transformations of the form Hg{x) = sinh(arcsinh(a;) 4- 6) defined over 
R. These are the so-called sinh-arcsinh transformations put to use in 
Jones and Pewsey (2009) in order to define sinh-arcsinh distributions which 
allow to cope for both skewness and kurtosis. The above factorizations are 
verified for T{0) = 1, Ui{x) = VTTx^ and U2ix) ^ x/VTTx^. 

Under these premisses, equation (3) becomes 

JVM^ 

fiHg{x)) 



d[U2iHe{x)) + Ui{Hg{x)) 



= U2{Hgix)) + U,iHg{x)f-^j§^ G A 

9[Hg{x)) 

which can be rewritten as 

U2ix) + Ui{x)Q^] = U2{x) + Uiix)^ Vx e S. 
fix) J 9{x) 

This first-order differentiable equation admits as solution g{x) = cexp(((i — 
1) U2{y)/Ui{y)dy){f{x)Y for some d > (such that g is integrable) and c a 
normalizing constant. This relationship establishes the "H-bascd e.c.'s. As for the 
scale case, c = d = \ when there exists xq £ S such that Ui{xf)) = 0, yielding 
e.c.'s constituted of singletons {/}. 
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For each transformation group "H, wc obtain the following MLE characteriza- 
tion theorem for one-parameter group families. The proof of this result contains 
nothing new and is thus omitted. 

Theorem 6.1. Let -^("H) and QiH) be two distinct H-based e.c. 's and let their 
respective representatives f and g be two continuously dijferentiable densities 
with common full support S. Let tp^ := U2{x) + Ui{x) f (x) / f (x) be the H-score 
function of f . If is invertible over S then there exists N E N such that, for 
any n > N , we have ^^"^ ~ §1^^ for all samples of size n if and only if there exist 
constants c,d G Rq such that g{x) ~ cexp((c?— 1) U2{y) /Ui{y)dy){f {x)Y for 
all X E S (with c = d ~ 1 if there exists Xq G S such that Ui{xq) = 0), that 
is, if and only if IF{'H) = Q{T~L)- The smallest integer for which this holds is 
MNSS = max{A^/,3}, with Nf the MCSS as defined in Lemma 3.L 

Aside from location- and scale-based (or variations thereof) characterizations, 
already available from Theorems 4.1 and 5.1, Theorem 6.1 allows, inter alia, 
to characterize asymmetric distributions (namely the sinh-arcsinh distributions 
of Jones and Pewsey (2009)) with respect to their skewness parameter. 

7. Examples 

In this section we analyze and discuss several examples of absolutely continuous 
distributions in the light of the findings of the previous sections. We indicate, in 
each case, the corresponding MNSS. As we shall see, we hereby retrieve (and get 
a better insight into) numerous existing results, and obtain several new ones. 
We stress that, in each case discussed below, the minimal sample size provided 
is optimal in the sense that counter-examples can be constructed if the results 
only are supposed to hold true for smaller sample sizes. 

For the sake of clarity, we will adopt in this section the commonly used 
notations fij and for location and scale ML estimators. 

7.1. The Gaussian distribution 

Consider the Gaussian distribution whose MLE characterizations for both the 
location and the scale parameter have been extensively discussed in the litera- 
ture. For (j) the standard Gaussian density we get f(p{x) = —x which is invert- 
ible over R and has image Im((^<^) = M. As already mentioned several times, 
the MLE /i^"'' is given by the sample arithmetic mean x. Thus Theorems 4.1 
and 4.2 apply, with MNSS ~ 3 since = = oo. The first corresponds to 
(Azzalini and Genton, 2007, Theorem 1), the second to (Tcichcr, 1961, Theo- 
rem 1). Regarding the scale characterization for o-^"-* = {n~^ 12^=1 ^i^^'^i direct 
calculations reveal that ip^{x) = 1 — which is invertible over both Rq and Mq 
and maps both domains onto (— oo, 1). The conditions of Theorem 5.2 are thus 
fulfilled and yield that the MNSS equals oo. Hence we retrieve (Teicher, 1961, 
Theorem 3). 
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7.2. The gamma distribution 

Consider the gamma distribution with tail parameter a > 0, whose density is 
given by 

f{x) = ^^x""^exp(-x)I(o,oo)(a;), 

where 1a represents the indicator function of the set A. The exponential den- 
sity is a special case of gamma densities obtained by setting a = 1. Gamma 
distributions are not natural location families; this can be seen for instance by 
considering the exponential case, where fjix) = 1 and hence the location like- 
lihood equations make no sense. Within the framework of the current paper 
we anyway do not provide a location-based MLE characterization of gamma 
densities, since their support is only K^j" instead of IR. On the contrary, gamma 
densities allow for agreeable scale characterizations. Indeed easy computations 
show that ff{x) = {—a -I- l)/a: -I- 1, ipf{x) = a — x, which is thus invertible over 
K^j", lm{ipf) = (— oo,a) and ct^"^ = a~^x. We can therefore use Theorem 5.1, 
which states that the gamma distribution with shape a is characterizable w.r.t. 
its scale MLE a~^x for an infinite MNSS. We hereby recover (Teicher, 1961, 
Theorem 2) and the univariate case of (Marshall and Olkin, 1993, Theorem 
5.1). 



7.3. The generalized Gaussian distribution 

Consider the one-parameter generalization of the normal distribution proposed 
in Ferguson (1962), with density 

/(^) "Fr~r exp(a7a: - aexp(7x)) 
r(a) 

where a > and 7, the additional parameter, differs from zero (Ferguson 
has proved that, for 7 — > 0, this density converges to the Gaussian). This 
probability law is in fact strongly related to the gamma distribution, as it is 
defined as 7^^1og(X) with X ^ Gamma(a). Now, direct calculations yield 
ff(x) = — 07(1 — exp(7a;)), invertible over M, lm{ipf) = sign(7)(— a|7|, 00) and 

/i^"^ ~ 7^^ log(ri~^ ^"^^ exp(7Xi)). Hence, from Theorem 4.1, we deduce that 
these distributions can be characterized in terms of their location parameter, 
with MNSS equal to 00; we retrieve (Ferguson, 1962, Theorem 5). Concerning 
the scale part, ipfi^) = a7a;(l — exp(7a;)) -I- 1 is not invertible over the whole 
real line, but invertible over both Rq and Rg", and maps both half-lines onto 
(— cx3,l). Consequently, Theorem 5.2 reveals that this distribution admits as 
well a scale MLE characterization result, with infinite MNSS. 

7.4. The Laplace distribution 

Consider the Laplace distribution with density 

/(a;)=exp(-|x|)/2. 



Duerinckx et al. /Maximum likelihood characterization 



22 



One easily obtains >Pf{x) — sign(x) and ipf{x) — — xsign(x) + l. While the former 
function is clearly not invertible at all (but allows for a location MLE charac- 
terization; see the end of Section 5), the latter is invertible on both Rq and 
with Im(V'/) = i~oo, 1). Hence Theorem 5.2 applies and reveals that the Laplace 
distribution is also MLE-characterizable w.r.t. its scale parameter (with infinite 
MNSS), which complements the existing results on MLE characterizations of 
the Laplace distribution from Ghosh and Rao (1971); Kagan, Linnik and Rao 
(1973); Marshall and Olkin (1993). 

Corollary 7.1. The statistic 

is the MLE of the scale parameter a within scale families over R for all sam- 
ples of all sample sizes if and only if the samples are drawn from a Laplace 
distribution. 

For the sake of readability we will, here and in the sequel, content ourselves 
with such informal statements of our characterization results; rigorous state- 
ments are straightforward adaptations of the corresponding theorems from the 
previous sections. 

7.5. The Weibull distribution 

Consider the Weibull distribution with density 

f{x) = fca;'=-iexp(-x''')I(o,oo)(a;), 

where fc > is the shape parameter. As for gamma distributions, we do not 
provide a location-based MLE characterization for this distribution on the pos- 
itive real half-line. Regarding the scale part, we have ffix) = — + kx^~^ , 
ipfix) = k{l — x'^), clearly invertible over Rj, and lm{ipf) — {—oo,k). Thus, 
all conditions for Theorem 5.1 are satisfied, from which we derive the follow- 
ing, to the best of our knowledge new, MLE characterization of the Weibull 
distribution. 

Corollary 7.2. The statistic 

is the MLE of the scale parameter a within scale families over M.q for all sam- 
ples of all sample sizes if and only if the samples are drawn from a Weibull 
distribution with shape parameter k. 
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7.6. The Gumbel distribution 



Consider the Gumbel distribution with density 



f{x) = exp(— a: — exp(— x)). 



Straightforward manipulations yield ^f{x) = 1 — exp(— a;), invertible over R 
and lm{(pf) = (— oo, 1). Thus, all conditions for Theorem 4.1 are satisfied, from 
which we derive the following, to the best of our knowledge new, MLE charac- 
terization of the Gumbel distribution. 

Corollary 7.3. The statistic 



is the MLE of the location parameter fi within location families over M for all 
samples of all sample sizes if and only if the samples are drawn from a Gumbel 
distribution. 

As for the scale part, it follows that ipf{x) = x{—l + exp(— a;)) + 1, non- 
invertible over R but invertible over both R^j" and Rq , and lm('0j) = (— oo, 1). 
Consequently, Theorem 5.2 applies and shows that the Gumbel distribution 
allows as well for a MLE characterization with respect to its scale parameter 
(with corresponding MNSS equal to oo). 

7.7. The Levy distribution 

Consider the Levy distribution with density 



Straightforward manipulations yield ffix) = — 4'f{x) = ^(^1 + ^/x) 
which is invertible over Rj and Im(?/'/) = (—1/2, 00). Theorem 5.1 applies and 
yields the (to the best of our knowledge) first MLE characterization of the Levy 
distribution. 

Corollary 7.4. The statistic 





n 



i=i 



is the MLE of the scale parameter a within scale families over Rq for all samples 
of all sample sizes if and only if the samples are drawn from a Levy distribution. 
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7.8. The Student distribution 

Consider the Student distribution with v > degrees of freedom, with density 

-(i.+ l)/2 

for niy) the appropriate normahzing constant. Then although the support of / 
is the whole real line, the location score function (pf{x) = (i/ + 1) ^^^^^ is not 
invertible and thus we cannot provide a location characterization. On the other 
hand straightforward computations yield 

This function is invertible over both the positive and the negative real half- 
line with Im('!/'/) = {—I', 1) and thus the Student distribution with degrees of 
freedom is by virtue of Theorem 5.2 scale-characterizable with 

[l + i] ifi^<l, 
MNSS ^ ( 3 if = 1, 
\l + iy] a iy> 1. 

This result generalizes the scale characterization of the Gaussian distribution, 
which is a particular case of the Student distributions when v tends to infinity. 
Note that the expression above then indeed yields an infinite MNSS. Moreover, 
the Cauchy distribution, obtained for = 1, is MLE-characterizable w.r.t. its 
scale parameter with an MNSS of 3. 

7.9. The sinh-arcsinh skew-normal distribution 

As a final example, we consider the sinh-arcsinh skew-normal distribution of 
Jones and Pewsey (2009) whose density is given by 

_ 1 (1 + Sinh^(arcsinh(x) + d))^/^ sinh^(arcsinh(a;)+a)/2 

where (5 e M is a skewness parameter regulating the asymmetric nature of the 
distribution. Clearly, for (5 = 0, corresponding to the symmetric situation, one 
retrieves the standard normal distribution. Now, straightforward but tedious 
calculations provide us with expressions for ipf and ipf which can both be seen 
to be non-invertible. Hence no location-based nor scale-based characterizations 
can be obtained. However, the sinh-arcsinh skew-normal distribution can be 
characterized w.r.t. its skewness parameter. As shown in Section 6, the sinh- 
arcsinh transform belongs to the class of transforms leading to group families. 
Consequently, its skewness score function is given by 
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with (f) the standard Gaussian density. This mapping is invertible over M with 
symmetric image K. Theorem 6.1 therefore apphes and yields the (to the best 
of our knowledge) first MLE characterization of the sinh-arcsinh skew-normal 
distribution (with respect to its skewness parameter) with an MNSS equal to 3. 

8. Discussion 

In this article we have provided a unified treatment of the topic of MLE charac- 
terizations for a wide variety of families of absolutely continuous distributions 
depending on a single one-dimensional parameter and satisfying certain regular- 
ity conditions. A natural question of interest is then in how far our methodology 
can be adapted to other distributions which do not satisfy these assumptions. Of 
particular interest are (i) parametric families whose score function is either not 
invertible or not differentiablc at a countable number of points (such as, e.g., the 
Laplace distribution), (ii) families depending on more than one parameter and 
(iii) discrete families. Although we will not cover these questions in full here, 
we conclude the paper by providing a number of intuitions on these questions; 
in all cases it seems clear that our methodology provides - at least in principle 
- the path towards a satisfactory answer. 

Regarding the first point, an interesting issue to investigate is how the non- 
invertibility of the score function influences the MNSS. Indeed in the case of 
a Laplace target the MNSS is known to be equal to 4 (see (Ghosh and Rao, 
1971; Kagan, Linnik and Rao, 1973)). This increase is due to the fact that the 
Laplace score function only takes on two distinct non-zero values so that having 
(9) for three sample points forces one of the observations to be (otherwise the 
equality cannot hold) and therefore the case n = 3 provides no more information 
than the case n = 2 (and thus MNSS > 4). It would of course be interesting to 
understand the influence of the number of distinct values taken by a given score 
function on the corresponding MNSS. One would, moreover, need to deal in this 
case with commensurability issues in order for the corresponding identity (9) to 
hold. Aside from these issues the question of characterizability is, to the best of 
our understanding, covered by our approach (see the end of Section 4). 

Regarding the second point, it seems straightforward (but clearly requires 
some care) to extend our method to a multi-dimensional location parameter, as 
is already done in Marshall and Olkin (1993) for a Gaussian target density. In a 
nutshell it suffices to project the now multi-dimensional location score function 
onto distinct-directional unit vectors and then proceed "as in the univariate 
case" . On the contrary, dealing with a high-dimensional scale parameter seems 
more difficult, as the scale parameter becomes a matrix- valued scatter or shape 
parameter. One possibility could be to try to adapt Marshall and Olkin (1993)'s 
working scheme, who have been able to provide an MLE characterization for the 
scatter parameter of a multinormal distribution. Along these lines a final issue 
that we have not considered is that of MLE characterizations of univariate target 
distributions with respect to multivariate parameters (such as the Gaussian 
in terms of its two parameters (/i,cr)). In support of our optimism for these 
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multivariate setups, see Duerinckx and Ley (2012) where our methodology was 
successfully applied to the (perhaps more complex) case of the spherical location 
parameter families. 

Finally concerning the discrete setup it seems clear that our approach again 
yields in principle a satisfactory answer. This point of view is further supported 
by Campbell (1970), where an extension of Gauss' principle is also discussed for 
discrete exponential families. In this framework, however, ascribing a general role 
to the parameter is relatively artificial (what is the parameter A in a Poisson 
distribution?) and it seems a priori difficult to obtain general results in the 
spirit of those presented in Sections 4 to 6. We defer the systematic treatment 
of this interesting question to later publications. 
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